fix(google-crc32c): release GIL for large buffers in crc32c operations#16975
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to release the Python Global Interpreter Lock (GIL) during CRC32C calculations for large, immutable buffers (>= 1MB) to improve performance in multi-threaded applications. This is achieved by adding a helper function _should_release_gil and wrapping the core calculation calls with PyEval_SaveThread and PyEval_RestoreThread. The review feedback points out that the inclusion of the <stdlib.h> header is unnecessary and should be removed to maintain code cleanliness.
dc41f67 to
27155d3
Compare
chandra-siri
approved these changes
May 11, 2026
nidhiii-27
pushed a commit
that referenced
this pull request
May 18, 2026
#16975) Release the Global Interpreter Lock (GIL) in `_crc32c_extend` and `_crc32c_value` when processing large, immutable byte buffers (>= 1MB). This allows other Python threads to run concurrently during expensive crc32c calculations on large chunks of data. Fixes #16923 🦕 # Unit Tests ``` .nox/check-3-13/bin/pytest -v tests ==================================================================================================================== test session starts ===================================================================================================================== platform linux -- Python 3.13.12, pytest-9.0.3, pluggy-1.6.0 -- /usr/local/google/home/zhixiangli/Cloudtop/Github/zhixiangli/google-cloud-python/packages/google-crc32c/.nox/check-3-13/bin/python cachedir: .pytest_cache rootdir: /usr/local/google/home/zhixiangli/Cloudtop/Github/zhixiangli/google-cloud-python/packages/google-crc32c configfile: pyproject.toml collected 42 items tests/test___init__.py::test_extend_w_empty_chunk PASSED [ 2%] tests/test___init__.py::test_extend_w_multiple_chunks PASSED [ 4%] tests/test___init__.py::test_extend_w_reduce PASSED [ 7%] tests/test___init__.py::test_value[-0] PASSED [ 9%] tests/test___init__.py::test_value[\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00-2324772522] PASSED [ 11%] tests/test___init__.py::test_value[\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff-1655221059] PASSED [ 14%] tests/test___init__.py::test_value[\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f-1188919630] PASSED [ 16%] tests/test___init__.py::test_value[\x1f\x1e\x1d\x1c\x1b\x1a\x19\x18\x17\x16\x15\x14\x13\x12\x11\x10\x0f\x0e\r\x0c\x0b\n\t\x08\x07\x06\x05\x04\x03\x02\x01\x00-289397596] PASSED [ 19%] tests/test___init__.py::test_value[chunk5-3650501206] PASSED [ 21%] tests/test___init__.py::test_value[\x01\xc0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x14\x00\x00\x00\x18(\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00-3650501206] PASSED [ 23%] tests/test___init__.py::TestChecksum::test_ctor_defaults[python] PASSED [ 26%] tests/test___init__.py::TestChecksum::test_ctor_defaults[cext] PASSED [ 28%] tests/test___init__.py::TestChecksum::test_ctor_explicit[python] PASSED [ 30%] tests/test___init__.py::TestChecksum::test_ctor_explicit[cext] PASSED [ 33%] tests/test___init__.py::TestChecksum::test_update[python] PASSED [ 35%] tests/test___init__.py::TestChecksum::test_update[cext] PASSED [ 38%] tests/test___init__.py::TestChecksum::test_update_w_multiple_chunks[python] PASSED [ 40%] tests/test___init__.py::TestChecksum::test_update_w_multiple_chunks[cext] PASSED [ 42%] tests/test___init__.py::TestChecksum::test_digest_zero[python] PASSED [ 45%] tests/test___init__.py::TestChecksum::test_digest_zero[cext] PASSED [ 47%] tests/test___init__.py::TestChecksum::test_digest_nonzero[python] PASSED [ 50%] tests/test___init__.py::TestChecksum::test_digest_nonzero[cext] PASSED [ 52%] tests/test___init__.py::TestChecksum::test_hexdigest_zero[python] PASSED [ 54%] tests/test___init__.py::TestChecksum::test_hexdigest_zero[cext] PASSED [ 57%] tests/test___init__.py::TestChecksum::test_hexdigest_nonzero[python] PASSED [ 59%] tests/test___init__.py::TestChecksum::test_hexdigest_nonzero[cext] PASSED [ 61%] tests/test___init__.py::TestChecksum::test_copy[python] PASSED [ 64%] tests/test___init__.py::TestChecksum::test_copy[cext] PASSED [ 66%] tests/test___init__.py::TestChecksum::test_consume_stream[python-1] PASSED [ 69%] tests/test___init__.py::TestChecksum::test_consume_stream[python-3] PASSED [ 71%] tests/test___init__.py::TestChecksum::test_consume_stream[python-5] PASSED [ 73%] tests/test___init__.py::TestChecksum::test_consume_stream[python-7] PASSED [ 76%] tests/test___init__.py::TestChecksum::test_consume_stream[python-11] PASSED [ 78%] tests/test___init__.py::TestChecksum::test_consume_stream[python-13] PASSED [ 80%] tests/test___init__.py::TestChecksum::test_consume_stream[python-48] PASSED [ 83%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-1] PASSED [ 85%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-3] PASSED [ 88%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-5] PASSED [ 90%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-7] PASSED [ 92%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-11] PASSED [ 95%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-13] PASSED [ 97%] tests/test___init__.py::TestChecksum::test_consume_stream[cext-48] PASSED [100%] ===================================================================================================================== 42 passed in 0.08s ===================================================================================================================== ``` # Perf Tests ## Methodology Created an independent benchmark script (shown below) that measures the time taken for `crc32c.value()` on different buffer sizes (10KB to 10MB) with 1 thread and 4 threads. ## Results | Size | Threads | Before (s) | After (s) | Speedup | | :--- | :--- | :--- | :--- | :--- | | **10KB** | 1 | 0.0001 | 0.0001 | 1.0x | | | 4 | 0.0031 | 0.0033 | 0.9x | | **500KB**| 1 | 0.0023 | 0.0022 | 1.0x | | | 4 | 0.0110 | 0.0110 | 1.0x | | **1MB** | 1 | 0.0045 | 0.0045 | 1.0x | | | 4 | 0.0241 | 0.0067 | **3.6x** | | **5MB** | 1 | 0.0291 | 0.0227 | 1.3x | | | 4 | 0.1209 | 0.0245 | **4.9x** | | **10MB** | 1 | 0.0550 | 0.0451 | 1.2x | | | 4 | 0.2091 | 0.0483 | **4.3x** | ## Conclusion - **Multi-threaded performance for large buffers (>= 1MB) improved significantly.** We see speedups of 3.6x to 4.9x when using 4 threads on buffers of 1MB and larger. - **No regression for small buffers (< 1MB)** where the GIL is not released. - **Single-threaded performance is also slightly better** or comparable, showing no negative impact from the conditional GIL release overhead. ## Code ``` import time import concurrent.futures import os import sys try: from google_crc32c import _crc32c import google_crc32c print(f"Successfully imported _crc32c: {_crc32c}") except ImportError as e: print(f"Failed to import _crc32c: {e}") print("This benchmark requires the C extension.") sys.exit(1) def benchmark_single_threaded(data, iterations=100): start = time.time() for _ in range(iterations): google_crc32c.value(data) return time.time() - start def benchmark_multi_threaded(data, num_threads=4, iterations=100): start = time.time() def worker(): for _ in range(iterations): google_crc32c.value(data) with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor: futures = [executor.submit(worker) for _ in range(num_threads)] concurrent.futures.wait(futures) return time.time() - start sizes = { "10KB": 10 * 1024, "500KB": 500 * 1024, "1MB": 1024 * 1024, "5MB": 5 * 1024 * 1024, "10MB": 10 * 1024 * 1024, } print(f"{'Size':<10} | {'Threads':<10} | {'Time (s)':<10}") print("-" * 35) # Warmup dummy_data = os.urandom(1024) google_crc32c.value(dummy_data) for name, size in sizes.items(): data = os.urandom(size) # Single threaded t_single = benchmark_single_threaded(data, iterations=100) print(f"{name:<10} | {'1':<10} | {t_single:.4f}") # Multi threaded t_multi = benchmark_multi_threaded(data, num_threads=4, iterations=100) print(f"{name:<10} | {'4':<10} | {t_multi:.4f}") ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release the Global Interpreter Lock (GIL) in
_crc32c_extendand_crc32c_valuewhen processing large, immutable byte buffers (>= 1MB). This allows other Python threads to run concurrently during expensive crc32c calculations on large chunks of data.Fixes #16923 🦕
Unit Tests
Perf Tests
Methodology
Created an independent benchmark script (shown below) that measures the time taken for
crc32c.value()on different buffer sizes (10KB to 10MB) with 1 thread and 4 threads.Results
Conclusion
Code